Fig 1: Proportion of variance explained. LDpred-Inf with sumamry statistics from UKBB, BBJ, meta-AFR, and meta-ALL. Points are empirical estimtes, and error bars are bootrstrap standard errors (n=5,000).
Fig 2: Proportion of variance explained. PRS calculated using linear combinations of PRS weights from LDpred-inf (PRS1, PRS2) and using local ancestry PRS (PRS3).HRS_AFR
Fig 3: Proportion of variance explained. PRS calculated using linear combinations of PRS weights from LDpred-inf (PRS1, PRS2) and using local ancestry PRS (PRS3).PMBB_AFR
## [[1]]
## IID PRS Summary_Stats
## 1: HG00096 -0.8527006 BBJ
## 2: HG00097 0.1212889 BBJ
## 3: HG00099 -0.5494812 BBJ
## 4: HG00100 -0.5839695 BBJ
## 5: HG00101 -0.4089522 BBJ
## ---
## 2544: NA21137 -0.8566551 BBJ
## 2545: NA21141 -1.7621960 BBJ
## 2546: NA21142 -2.0326330 BBJ
## 2547: NA21143 -0.6662671 BBJ
## 2548: NA21144 -1.6150280 BBJ
##
## [[2]]
## IID PRS Summary_Stats
## 1: HG00096 27.05733 GIANT
## 2: HG00097 27.41120 GIANT
## 3: HG00099 27.12162 GIANT
## 4: HG00100 26.67773 GIANT
## 5: HG00101 26.31729 GIANT
## ---
## 2544: NA21137 18.69068 GIANT
## 2545: NA21141 18.08475 GIANT
## 2546: NA21142 18.15647 GIANT
## 2547: NA21143 18.39023 GIANT
## 2548: NA21144 18.03324 GIANT
##
## [[3]]
## IID PRS Summary_Stats
## 1: HG00096 0.2384030 META_AFR
## 2: HG00097 0.3688721 META_AFR
## 3: HG00099 0.5712746 META_AFR
## 4: HG00100 0.4977523 META_AFR
## 5: HG00101 0.5995154 META_AFR
## ---
## 2544: NA21137 0.3496227 META_AFR
## 2545: NA21141 0.2533668 META_AFR
## 2546: NA21142 0.2311690 META_AFR
## 2547: NA21143 0.2501532 META_AFR
## 2548: NA21144 0.2052742 META_AFR
##
## [[4]]
## IID PRS Summary_Stats
## 1: HG00096 0.05689507 META_AFR2
## 2: HG00097 0.08622545 META_AFR2
## 3: HG00099 0.09906398 META_AFR2
## 4: HG00100 0.11491700 META_AFR2
## 5: HG00101 0.12450070 META_AFR2
## ---
## 2544: NA21137 0.06857799 META_AFR2
## 2545: NA21141 0.03008647 META_AFR2
## 2546: NA21142 0.02576827 META_AFR2
## 2547: NA21143 0.04221819 META_AFR2
## 2548: NA21144 0.03798021 META_AFR2
##
## [[5]]
## IID PRS Summary_Stats
## 1: HG00096 0.2550199 META_ALL
## 2: HG00097 0.4226962 META_ALL
## 3: HG00099 0.4593673 META_ALL
## 4: HG00100 0.5827615 META_ALL
## 5: HG00101 0.4737461 META_ALL
## ---
## 2544: NA21137 0.4528811 META_ALL
## 2545: NA21141 0.2801209 META_ALL
## 2546: NA21142 0.1636204 META_ALL
## 2547: NA21143 0.3484188 META_ALL
## 2548: NA21144 0.4033137 META_ALL
##
## [[6]]
## IID PRS Summary_Stats
## 1: HG00096 0.7188846 META_ALL2
## 2: HG00097 0.8236255 META_ALL2
## 3: HG00099 0.8149656 META_ALL2
## 4: HG00100 0.9149501 META_ALL2
## 5: HG00101 0.8167783 META_ALL2
## ---
## 2544: NA21137 0.6600387 META_ALL2
## 2545: NA21141 0.4850991 META_ALL2
## 2546: NA21142 0.4344598 META_ALL2
## 2547: NA21143 0.5892258 META_ALL2
## 2548: NA21144 0.5828353 META_ALL2
##
## [[7]]
## IID PRS Summary_Stats
## 1: HG00096 2.319422 META_EUR
## 2: HG00097 2.580297 META_EUR
## 3: HG00099 2.526723 META_EUR
## 4: HG00100 2.636959 META_EUR
## 5: HG00101 2.565912 META_EUR
## ---
## 2544: NA21137 1.625801 META_EUR
## 2545: NA21141 1.298389 META_EUR
## 2546: NA21142 1.381026 META_EUR
## 2547: NA21143 1.416706 META_EUR
## 2548: NA21144 1.502949 META_EUR
##
## [[8]]
## IID PRS Summary_Stats
## 1: HG00096 0.082338810 META_NEA
## 2: HG00097 0.165794200 META_NEA
## 3: HG00099 0.007156949 META_NEA
## 4: HG00100 0.050496770 META_NEA
## 5: HG00101 0.111681700 META_NEA
## ---
## 2544: NA21137 0.104231000 META_NEA
## 2545: NA21141 0.007553810 META_NEA
## 2546: NA21142 -0.096658380 META_NEA
## 2547: NA21143 0.020338990 META_NEA
## 2548: NA21144 0.014733670 META_NEA
##
## [[9]]
## IID PRS Summary_Stats
## 1: HG00096 -0.02684305 PAGE
## 2: HG00097 0.12634790 PAGE
## 3: HG00099 0.17277890 PAGE
## 4: HG00100 0.07942150 PAGE
## 5: HG00101 0.11524650 PAGE
## ---
## 2544: NA21137 -0.11703410 PAGE
## 2545: NA21141 -0.15028210 PAGE
## 2546: NA21142 -0.33831460 PAGE
## 2547: NA21143 0.04048262 PAGE
## 2548: NA21144 -0.24686130 PAGE
##
## [[10]]
## IID PRS Summary_Stats
## 1: HG00096 -0.5076630000 UKBB
## 2: HG00097 -0.3305627000 UKBB
## 3: HG00099 -0.1356251000 UKBB
## 4: HG00100 0.1636529000 UKBB
## 5: HG00101 -0.0002598876 UKBB
## ---
## 2544: NA21137 0.1901339000 UKBB
## 2545: NA21141 -0.3319316000 UKBB
## 2546: NA21142 -0.2989896000 UKBB
## 2547: NA21143 0.0275398500 UKBB
## 2548: NA21144 0.1834473000 UKBB
Fig 4: 1000 GENOMES
In our previous work, we analysed the factors that drive reduced prediction accuracy of polygenic scores for height in individuals with African ancestry.
We saw that SFS and LD play a role, but there is also suggestive evidence that differences in marginal effect sizes exist.
In that study we ran a GWAS in ~8,000 individuals with African ancestry from the UKBB and tested for differences in marginal effect sizes between those and European derived effect sizes, as well as correlations of those differences with allelic frequency differences. Finally, we implemented ancestry-informed PRSs in the admixed individuals, and observed only very modest improvement in prediction accuracy.
It is possible that that modest improvement was due to our low sample size. So here we use a much larger sample size (about 58K African ancestry individuals and 91K total) to explore the potential of ancestry-informed PRSs for height. We also try a larger meta-analysis, with 58K African ancestry individuals and
Another interesting thing is to see whether by fine-mapping index variants by including African ancestry we can select SNPs that yield better PRS performance.
Questions:
Do multi-PRS and LA-PRS increase in prediction by using effect-sizes from a meta-African analysis? What about a meta-Pan analysis?
What is the overlap between GWAS hits between GWAS for EUR only and BBJ only, AFR only and combinations of those?
When we select ancestry-specific index variants and then use those in the PRS, does prediction improve?
For now, we are focusing on height only.
We use GWAS summary statistics for height from six sources::
*UKBB_eur: UK Biobank Europeans
*BBJ: Biobank Japan
*Uganda Genome Project - which is a meta-analysis of Uganda + 3 other populations from Africa, described in the Uganda Genome Project paper);
UKBB_afr- from the African subset from the panUKBB dataset.
N’diaye et al. 2011 - still the largest height GWAS performed in African ancestry individuals;
PAGE, a large meta-analysis including 35% African Americans and the remaining participants are mostly of Hispanic/Latino and other minority ancestries.
So our meta-AFR analysis includes: PAGE, Ndiaye, UKBB_afr and 4 cohorts from UGP.
Our meta-all analysis includes: meta-AFR, UKBB_eur, BBJ.
We performed two meta-analysis:
*meta-AFR: UGP+pan-UKBB(AFR)+N’Diaye et al. 2011 meta-analysis, PAGE project. Total of 90,970 individuals (58488 of African ancestry). See Table 1.
*meta-ALL: Our meta-ALL analysis includes: UKBB_eur, meta-AFR (previous step), BBJ (Biobank Japan, N=159,095) Total of 610,453 (58488 of African ancestry). See Table 1.
Note that both have the same amount of African ancestry individuals (N=XX). We performed meta-ALL to check whether bigger sample size and increased diversity in the discovery cohort would improve predictions.
We ran a meta-analysis using METAL using one file for each of the above datasets. We set genomic correction to “ON”, meaning it is performed for each file (not the final values). We performed the meta-analysis using SCHEME STDERR, meaning betas and SE are used. For the meta-AFR analysis, we set AVERAGEFREQ and MINMAXFREQ to “ON” so that metal can track large allelic frequency differences across datasets as suggestion of allelic mismatch. We only report results for variants that have a combined weight of at least 49,781 (meta-AFR) or 590026 individuals, resulting in about 20 million autosomal variants in both datasets.
We inspected the p-value distribution of these meta-analyses using QQ-plots and calculated the genomic inflation on the final p-values, and performed corrections accordingly.
Most were in hg19 build, except N’diaye, which we lift over from hg18 to hg19. Previous filtering was done in each of these studies, and there is often not enough information for us to perform our own filtering.
UGP: this is very recent. They filtered for imputation score > 0.3.
pan-UKBB: They filter for INFO scores > 0.8 and minimum allele count of 20 in each population. They also provide a True/False filter for “low_quality_AFR” which we use, retaining only those for which it is ‘false’. GWAS included: Age, sex, Age*sex, Age2, Age2*sex, the first 10 PCs. Inverse-normal transformation of height in cm.
N’diaye et al.: The genomic control inflation (GC) factor was calculated for each study and used for within-study correction, prior to the meta-analysis. The overall lambda they report is 1.064 (which we confirm, see table below) suggesting no inflation in this meta-analysis. Imputation info score not available, but authors filtered for >= 0.3. Betas and SE in units of z-score.
PAGE: inverse-normal-adjusted residuals for each trait outcome. Info score available. Filtered for > 0.4 by authors prior. We were more strict and filtered for > 0.8.
As mentioned, for the meta-analyses summary statistics we only retained positions for which there was information for most individuals in the meta-analysis (20.7 and 23.7 M SNPs for meta-AFR and meta-ALL respectively).
For UKBB_eur, we retained only SNPs with INFO> 0.8 (11.9 M SNPs) and low_quality_variant=FALSE (15.4 M SNPs). Only autosomal SNPs were analyzed.
For PRS using summary statistics from the UKBB_eur, we used the UKBB_eur (5,000 randomly sampled) imputed data as LD reference panel.For PRS using the BBJ summary statictics, we used a combination of the 1000G Phase 3 East Asians and UKBB Chinese individuals. For the meta-AFR summary statistics, we used a combination of UKBB_afr and 1000G Phase 3 African ancestry individuals. For the meta-ALL summary statistics, we used a combination of all Phase 3 1000G individuals, the UKBB_eur, UKBB_afr, and UKBB_chi. In all cases, the combined sets were QC’d to only include unrelated individuals (plink –rel-cutoff 0.125) and with genotype missingness < 0.85. We further restricted these sets to SNPs with MAF > 0.001. We further removed SNPs with allelic mismatch with the UKBB_EUR summary statistics file and corrected for strand flipping when appropriate.
For test data, we used the Penn Biobank subsets of European American and African American individuals (Table XX), the HRS subsets of European and African Americans, and the UKBB Chinese individuals (Table XX). Individuals with height further than two deviations from the sex-cohort specific mean were not included. (Table 3)
Genotype data from test cohorts was lifted over to hg38 when needed.
PMBB (Penn Biobank): with sets of EUR (7501) and AFR ancestry individuals (9226)
UKB_CHI (UKBB Chinese): a set of 1,504 individuals with Chinese ancestry from the UK Biobank.
HRS (Health and Retirement Study): with sets of EUR (10,486) and AFR (2,322) ancestry individuals.
We visually inspected qq-plots of height residuals for each dataset to check for extreme outliers. Based on this inspection, we restricted PMBB (Figs 3-4 for before and after filtering) and HRS (Figs 5-6 for before and after filtering) samples to those for which residual height was between \(\pm3\) standard deviations from the mean for each sex. For UKB-CHI, no filtering was necessary (Fig 7). Height residuals were obtained by regressing height on all co-variates and their interactions for each individual:
\[height\sim Sex+Age+Age^2+Sex*Age+Sex*Age^2+pEUR+Sex*pEUR+Age*pEUR+Age^2*pEUR\]
, where \(p_{EUR}\) is the genome-wide average proportion of European ancestry for PMBB_afr and HRS_afr (estimated through RFMIx), and the European ancestry component (estimated through unsupervised ADMIXTURE with k=2) for UKB_CHI. For HRS_eur and PMBB_eur, we set \(p_{EUR}\) to 1.
When multiple time points were available for each individual, we retained the one corresponding to the latest height measure and age. All height phenotype data was formatted to be in centimeters.
Each test cohort was randomly divided into a “train” and a “test” set following the ratio of 0.15 (train) and 0.85 (test) for most datasets, except for UKB_CHI and HRS_afr, where we used 0.20:0.80 (Table 2). We performed a stratified split of the data using the initial_split function from the rsample R package. We used ‘Sex’ as strate, i.e, to maintain Sex proportions within training and testing sets similar (Table 2)
We used LDpred for PRS calculations. For UKBB_eur summary statistics, we used the UKBB_eur as LD reference panel; for BBJ and meta-AFR we used East Asians and Africans from 1000G Phase 3, respectively. We first ran ldpred coord to coordinate the summary statistics, test and LD datasets. Next we ran the gibbs sampler. Many values of p did not covnerge, but typically p=1 and p=0.3 did converge, so we looked at those, as well as the infinitesimal model. See Table
PRS_eur: PRS using effect sizes (\(\beta\)) from UKBB_eur.
PRS_eas: PRS using effect sizes (\(\beta\)) from BBJ (all East Asian).
PRS_afr: PRS using effect sizes ((\(\beta\)) from the meta-AFR GWAS.
PRS_all: PRS using effect sizes ((\(\beta\)) from the meta-ALL GWAS.
\[height~Sex+Age+Age2+pEUR\]
\[height~Sex+Age+Age2+pEUR+PRS_{eur}\]
PRS1_ML (described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020)
PRS2_BD - linear combination of PRS described in Bitarello & Mathieson 2020